Customer Churn Analysis

Churn rate, when applied to a customer base, refers to the proportion of contractual customers or subscribers who leave a supplier during a given time period. It is a possible indicator of customer dissatisfaction, cheaper and/or better offers from the competition, more successful sales and/or marketing by the competition, or reasons having to do with the customer life cycle.

Churn is closely related to the concept of average customer lifetime. For example, an annual churn rate of 25 percent implies an average customer life of four years. An annual churn rate of 33 percent implies an average customer life of three years. The churn rate can be minimized by creating barriers which discourage customers to change suppliers (contractual binding periods, use of proprietary technology, value-added services, unique business models, etc.), or through retention activities such as loyalty programs. It is possible to overstate the churn rate, as when a consumer drops the service but then restarts it within the same year. Thus, a clear distinction needs to be made between "gross churn", the total number of absolute disconnections, and "net churn", the overall loss of subscribers or members. The difference between the two measures is the number of new subscribers or members that have joined during the same period. Suppliers may find that if they offer a loss-leader "introductory special", it can lead to a higher churn rate and subscriber abuse, as some subscribers will sign on, let the service lapse, then sign on again to take continuous advantage of current specials. https://en.wikipedia.org/wiki/Churn_rate


In [1]:
%%capture

# Get our favorite packages from PyPI
! pip install cufflinks

# Import pre-installed packages
import numpy as np
import pandas as pd

# Suppress unwatned warnings
import warnings
warnings.filterwarnings('ignore')
import logging
logging.getLogger("requests").setLevel(logging.WARNING)

In [2]:
# Load our favorite visualization library
import os
import plotly
import plotly.plotly as py
import plotly.figure_factory as ff
import plotly.graph_objs as go
import cufflinks as cf
plotly.offline.init_notebook_mode(connected=True)

# Sign into Plotly with masked, encrypted API key

myPlotlyKey = os.environ['SECRET_ENV_BRETTS_PLOTLY_KEY']
py.sign_in(username='bretto777',api_key=myPlotlyKey)



In [3]:
# Load some data
churnDF = pd.read_csv('https://s3-us-west-1.amazonaws.com/dsclouddata/home/jupyter/churn_train.csv', delimiter=',')
churnDF["Churn"] = churnDF["Churn"].replace(to_replace=False, value='Retain')
churnDF["Churn"] = churnDF["Churn"].replace(to_replace=True, value='Churn')
churnDFs = churnDF.sample(frac=0.07) # Sample for speedy viz
churnDF.head(10)


Out[3]:
State Account Length Area Code Phone Int'l Plan VMail Plan VMail Message Day Mins Day Calls Day Charge ... Eve Calls Eve Charge Night Mins Night Calls Night Charge Intl Mins Intl Calls Intl Charge CustServ Calls Churn
0 ND 84 415 400-7253 no yes 33 159.1 106 27.05 ... 101 12.73 213.4 108 9.60 13.0 18 3.51 1 Retain
1 RI 117 408 370-5042 no yes 13 207.6 65 35.29 ... 77 12.98 232.8 95 10.48 9.7 3 2.62 1 Retain
2 VA 132 510 343-4696 no no 0 81.1 86 13.79 ... 72 20.84 237.0 115 10.67 10.3 2 2.78 0 Retain
3 OK 121 408 364-2495 no yes 31 237.1 63 40.31 ... 117 17.48 196.7 85 8.85 10.1 5 2.73 4 Retain
4 ME 205 510 413-4039 no yes 24 175.8 139 29.89 ... 98 13.18 180.7 64 8.13 7.8 5 2.11 2 Retain
5 MT 116 415 384-5907 no yes 35 182.8 122 31.08 ... 119 18.08 193.8 103 8.72 11.0 2 2.97 1 Retain
6 MN 113 408 417-5146 no no 0 158.9 137 27.01 ... 109 20.64 247.8 97 11.15 6.5 4 1.76 0 Retain
7 MO 74 415 421-2955 no no 0 172.1 105 29.26 ... 99 17.99 182.2 105 8.20 11.6 6 3.13 1 Retain
8 MI 112 510 420-1383 no no 0 243.4 77 41.38 ... 97 15.48 259.2 94 11.66 12.8 2 3.46 1 Retain
9 UT 68 415 403-8916 no no 0 213.9 112 36.36 ... 100 22.14 233.8 97 10.52 8.4 3 2.27 1 Churn

10 rows × 21 columns


In [4]:
# separate the calls data for plotting


churnDFs = churnDFs[['Account Length','Day Calls','Eve Calls','CustServ Calls','Churn']]

# Create scatter plot matrix of call data
splom = ff.create_scatterplotmatrix(churnDFs, diag='histogram', index='Churn',  
                                  colormap= dict(
                                      Churn = '#9CBEF1',
                                      Retain = '#04367F'
                                      ),
                                  colormap_type='cat',
                                  height=560, width=650,
                                  size=4, marker=dict(symbol='circle'))
py.iplot(splom)


Out[4]:

In [5]:
import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.deeplearning import H2ODeepLearningEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
from h2o.grid.grid_search import H2OGridSearch
from __future__ import print_function
h2o.init(nthreads=1, max_mem_size=2)
h2o.remove_all()


[WARNING] H2O requires requests module of version 2.10 or newer. You have version 2.4.3.
You can upgrade to the newest version of the module running from the command line
    $ pip2 install --upgrade requests
Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: java version "1.7.0_151"; OpenJDK Runtime Environment (IcedTea 2.6.11) (7u151-2.6.11-1~deb8u1); OpenJDK 64-Bit Server VM (build 24.151-b01, mixed mode)
  Starting server from /usr/local/lib/python2.7/dist-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpETtwxj
  JVM stdout: /tmp/tmpETtwxj/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmpETtwxj/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.
H2O cluster uptime: 02 secs
H2O cluster version: 3.14.0.7
H2O cluster version age: 5 days
H2O cluster name: H2O_from_python_unknownUser_0l1b52
H2O cluster total nodes: 1
H2O cluster free memory: 1.778 Gb
H2O cluster total cores: 16
H2O cluster allowed cores: 1
H2O cluster status: accepting new members, healthy
H2O connection url: http://127.0.0.1:54321
H2O connection proxy: None
H2O internal security: False
H2O API Extensions: AutoML, XGBoost, Algos, Core V3, Core V4
Python version: 2.7.9 final

In [6]:
# Split data into training and testing frames

from sklearn import cross_validation
from sklearn.model_selection import train_test_split

training, testing = train_test_split(churnDF, train_size=0.8, stratify=churnDF["Churn"], random_state=9)
train = h2o.H2OFrame(python_obj=training).drop("State")
test = h2o.H2OFrame(python_obj=testing).drop("State")

# Set predictor and response variables
y = "Churn"
x = train.columns
x.remove(y)


/usr/local/lib/python2.7/dist-packages/sklearn/cross_validation.py:41: DeprecationWarning:

This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%

Super Learner

The super learner is a prediction method designed to find the optimal combination of a collection of prediction algorithms. The super learner algorithm finds the combination of algorithms minimizing the cross-validated risk. The super learner framework is9 built in the theory of cross-validation and allows for a general class of prediction algorithms to be considered for the ensemble. http://biostats.bepress.com/ucbbiostat/paper266/ (Polley & Van der Laan, 2010)


In [20]:
# Reset variables
# del allModels, gridGBM, gridRF, grids, dfGridGBM, dfGridRF, ensemble

In [21]:
# Reset variables
# del SuperModel, BestModel, Model3, Model4, Model5, Model6, Model7, Model8, Model9, Model10

In [7]:
%%time
# GBM hyperparameters 
nfolds = 5
gbm_hyper_params = {"learn_rate":[0.075, 0.1], "nbins":[10,15,20],"ntrees": [20,30,40], "max_depth": [5,7,9], "sample_rate": [0.75, 0.8, 0.85, 0.9]}
search_criteria = {"strategy": "RandomDiscrete", "max_models": 6}

# Setup the GBM grid search
gridGBM = H2OGridSearch(model=H2OGradientBoostingEstimator(balance_classes=True, seed=123, nfolds=nfolds, fold_assignment="Modulo", keep_cross_validation_predictions=True),
                     hyper_params=gbm_hyper_params,
                     search_criteria=search_criteria,
                     grid_id="gbm_grid_binomial")

# Start the GBM training
gridGBM.train(x=x, y=y, training_frame=train)
         

    
# Random Forest hyperparameters
rf_hyper_params = {"mtries":[12,15,18],"nbins":[10,20,30], "ntrees": [25,50,75], "max_depth": [5,7], "sample_rate": [0.75, 0.8, 0.85, 0.9]}
gridRF = H2OGridSearch(model=H2ORandomForestEstimator(balance_classes=True, seed=123, nfolds=nfolds, fold_assignment="Modulo", keep_cross_validation_predictions=True),                     
                      hyper_params=rf_hyper_params,
                      search_criteria=search_criteria,
                      grid_id="rf_grid_binomial")

# Start the Random Forest training
gridRF.train(x=x, y=y, training_frame=train)


# List the GBMs and Random Forests that we wish to ensemble
grids = gridGBM.model_ids + gridRF.model_ids


# Train the super learner
ensemble = H2OStackedEnsembleEstimator(model_id="GBM-RF-ensemble", base_models=grids, training_frame=train, validation_frame=test)
ensemble.train(x=x, y=y, training_frame=train)


# Evaluate ensemble performance on the test data
perf_stack_test = ensemble.model_performance(test)


# Compare the super learner to the base learners. First, combine all the base models into a single list, sorted by auc.
dfGridGBM = pd.DataFrame(data=gridGBM.get_grid(sort_by="auc", decreasing=True).sorted_metric_table())
dfGridRF = pd.DataFrame(data=gridRF.get_grid(sort_by="auc", decreasing=True).sorted_metric_table())
allModels = dfGridGBM.append(dfGridRF)
allModels['auc'] = allModels['auc'].astype('float64')
allModels.sort_values(by="auc", ascending=False, inplace=True)
allModels = allModels.reset_index()
baselearner_best_name = allModels.loc[0,'model_ids']
baselearner_best_auc = allModels.loc[0,'auc']

# Best stacked model auc
stack_auc_test = perf_stack_test.auc()


print("Best Base-learner Test AUC: " + str(baselearner_best_auc))
print("Ensemble Test AUC: " + str(stack_auc_test))


gbm Grid Build progress: |████████████████████████████████████████████████| 100%
drf Grid Build progress: |████████████████████████████████████████████████| 100%
stackedensemble Model Build progress: |███████████████████████████████████| 100%
Best Base-learner Test AUC: 0.907715225921
Ensemble Test AUC: 0.950917068651
CPU times: user 2.01 s, sys: 312 ms, total: 2.32 s
Wall time: 1min 35s

Leaderboard


In [8]:
allModels


Out[8]:
index auc learn_rate max_depth model_ids mtries nbins ntrees sample_rate
0 0 0.907715 0.1 5 gbm_grid_binomial_model_3 NaN 10 20 0.8
1 1 0.906921 0.075 9 gbm_grid_binomial_model_1 NaN 20 40 0.8
2 2 0.905437 0.1 9 gbm_grid_binomial_model_5 NaN 15 40 0.9
3 3 0.904353 0.075 9 gbm_grid_binomial_model_2 NaN 20 30 0.8
4 4 0.903081 0.075 9 gbm_grid_binomial_model_4 NaN 10 30 0.9
5 5 0.902071 0.1 5 gbm_grid_binomial_model_0 NaN 15 20 0.75
6 0 0.901734 NaN 7 rf_grid_binomial_model_0 12 10 50 0.9
7 1 0.900668 NaN 7 rf_grid_binomial_model_5 15 10 50 0.75
8 2 0.898025 NaN 7 rf_grid_binomial_model_3 15 20 50 0.75
9 3 0.893056 NaN 5 rf_grid_binomial_model_1 15 20 75 0.85
10 4 0.886211 NaN 5 rf_grid_binomial_model_2 18 30 50 0.85
11 5 0.879231 NaN 7 rf_grid_binomial_model_4 18 20 50 0.85

Variable Importances

Below we plot variable importances as reported by the best performing algo in the ensemble.


In [9]:
best = h2o.get_model(baselearner_best_name)

importances = best.varimp(use_pandas=True)
importances = importances.loc[:,['variable','relative_importance']].groupby('variable').mean()
importances.sort_values(by="relative_importance", ascending=False).iplot(kind='bar', colors='#5AC4F2', theme='white')


Out[9]:

Super Model vs the Base models

This plot shows the ROC curves for the Super Model, the Best Base Model, and 9 next best models in the ensemble.


In [10]:
SuperModel = np.array(ensemble.roc(valid=True))
BestModel = np.array(h2o.get_model(baselearner_best_name).roc(xval=True))
Model2 = np.array(h2o.get_model(allModels.loc[1,'model_ids']).roc(xval=True))
Model3 = np.array(h2o.get_model(allModels.loc[2,'model_ids']).roc(xval=True))
Model4 = np.array(h2o.get_model(allModels.loc[3,'model_ids']).roc(xval=True))
Model5 = np.array(h2o.get_model(allModels.loc[4,'model_ids']).roc(xval=True))
Model6 = np.array(h2o.get_model(allModels.loc[5,'model_ids']).roc(xval=True))
Model7 = np.array(h2o.get_model(allModels.loc[6,'model_ids']).roc(xval=True))
Model8 = np.array(h2o.get_model(allModels.loc[7,'model_ids']).roc(xval=True))
Model9 = np.array(h2o.get_model(allModels.loc[8,'model_ids']).roc(xval=True))
Model10 = np.array(h2o.get_model(allModels.loc[9,'model_ids']).roc(xval=True))



layout = go.Layout(autosize=False, width=725, height=575,  xaxis=dict(title='False Positive Rate', titlefont=dict(family='Arial, sans-serif', size=15, color='grey')), 
                                                           yaxis=dict(title='True Positive Rate', titlefont=dict(family='Arial, sans-serif', size=15, color='grey')))

SuperModelTrace = go.Scatter(x = SuperModel[0],y = SuperModel[1], mode = 'lines', name = 'Super Model', line = dict(color = ('rgb(26, 58, 126)'), width = 3))
BestModelTrace = go.Scatter(x = BestModel[0],y = BestModel[1], mode = 'lines', name = 'Best Base Model', line = dict(color = ('rgb(135, 160, 216)'), width = 3))
Model2Trace = go.Scatter(x = Model2[0], y = Model2[1], mode = 'lines', name = 'Model 2', line = dict(color = ('rgb(156, 190, 241)'), width = 1))
Model3Trace = go.Scatter(x = Model3[0], y = Model3[1], mode = 'lines', name = 'Model 3', line = dict(color = ('rgb(156, 190, 241)'), width = 1))
Model4Trace = go.Scatter(x = Model4[0], y = Model4[1], mode = 'lines', name = 'Model 4', line = dict(color = ('rgb(156, 190, 241)'), width = 1))
Model5Trace = go.Scatter(x = Model5[0], y = Model5[1], mode = 'lines', name = 'Model 5', line = dict(color = ('rgb(156, 190, 241)'), width = 1))
Model6Trace = go.Scatter(x = Model6[0], y = Model6[1], mode = 'lines', name = 'Model 6', line = dict(color = ('rgb(156, 190, 241)'), width = 1))
Model7Trace = go.Scatter(x = Model7[0], y = Model7[1], mode = 'lines', name = 'Model 7', line = dict(color = ('rgb(156, 190, 241)'), width = 1))
Model8Trace = go.Scatter(x = Model8[0], y = Model8[1], mode = 'lines', name = 'Model 8', line = dict(color = ('rgb(156, 190, 241)'), width = 1))
Model9Trace = go.Scatter(x = Model9[0], y = Model9[1], mode = 'lines', name = 'Model 9', line = dict(color = ('rgb(156, 190, 241)'), width = 1))
Model10Trace = go.Scatter(x = Model10[0], y = Model10[1], mode = 'lines', name = 'Model 10', line = dict(color = ('rgb(156, 190, 241)'), width = 1))

traceChanceLine = go.Scatter(x = [0,1], y = [0,1], mode = 'lines+markers', name = 'chance', line = dict(color = ('rgb(136, 140, 150)'), width = 4, dash = 'dash'))

fig = go.Figure(data=[SuperModelTrace,BestModelTrace,Model2Trace,Model3Trace,Model4Trace,Model5Trace,Model7Trace,Model8Trace,Model9Trace,Model10Trace,traceChanceLine], layout=layout)


py.iplot(fig)


Out[10]:

Confusion Matrix


In [11]:
cm = perf_stack_test.confusion_matrix()
cm = cm.table.as_data_frame()
cm
confusionMatrix = ff.create_table(cm)
confusionMatrix.layout.height=300
confusionMatrix.layout.width=800
confusionMatrix.layout.font.size=17
py.iplot(confusionMatrix)


Out[11]:

Business Impact Matrix

Weighting Predictions With a Dollar Value

  • Correctly predicting retain: +$5
  • Correctly predicting churn: +$75
  • Incorrectly predicting retain: -$150
  • Incorrectly predicting churn: -$1.5

In [12]:
CorrectPredictChurn = cm.loc[0,'Churn']
CorrectPredictChurnImpact = 75
cm1 = CorrectPredictChurn*CorrectPredictChurnImpact

IncorrectPredictChurn = cm.loc[1,'Churn']
IncorrectPredictChurnImpact = -5
cm2 = IncorrectPredictChurn*IncorrectPredictChurnImpact

IncorrectPredictRetain = cm.loc[0,'Retain']
IncorrectPredictRetainImpact = -150
cm3 = IncorrectPredictRetain*IncorrectPredictRetainImpact

CorrectPredictRetain = cm.loc[0,'Retain']
CorrectPredictRetainImpact = 5
cm4 = IncorrectPredictRetain*CorrectPredictRetainImpact


data_matrix = [['Business Impact', '($) Predicted Churn', '($) Predicted Retain', '($) Total'],
               ['($) Actual Churn', cm1, cm3, '' ],
               ['($) Actual Retain', cm2, cm4, ''],
               ['($) Total', cm1+cm2, cm3+cm4, cm1+cm2+cm3+cm4]]

impactMatrix = ff.create_table(data_matrix, height_constant=20, hoverinfo='weight')
impactMatrix.layout.height=300
impactMatrix.layout.width=800
impactMatrix.layout.font.size=17
py.iplot(impactMatrix)


Out[12]:

In [13]:
print("Best learner AUC: " + str(baselearner_best_auc))


Best learner AUC: 0.907715225921

In [14]:
print("Super Model AUC: " + str(stack_auc_test))


Super Model AUC: 0.950917068651

In [15]:
print("Total customers evaluated: 534")


Total customers evaluated: 534

In [16]:
print("Total value created by the model: $" + str(cm1+cm2+cm3+cm4))


Total value created by the model: $1440.0

In [17]:
print("Total value per customer: $" +str(round(((cm1+cm2+cm3+cm4)/534),3)))


Total value per customer: $2.697

In [19]:
# Save the Model
h2o.save_model(model=ensemble, force=True)
LoadedEnsemble = h2o.load_model(path='GBM-RF-ensemble')